# html-to-text

When ingesting HTML documents for later retrieval, we are often interested only in the actual content of the webpage rather than semantics.
Stripping HTML tags from documents with the HtmlToTextTransformer can result in more content-rich chunks, making retrieval more effective.

## Setup

You'll need to install the [`html-to-text`](https://www.npmjs.com/package/html-to-text) npm package:

```bash npm2yarn
npm install html-to-text
```

Though not required for the transformer by itself, the below usage examples require [`cheerio`](https://www.npmjs.com/package/cheerio) for scraping:

```bash npm2yarn
npm install cheerio
```

import IntegrationInstallTooltip from "@mdx_components/integration_install_tooltip.mdx";

<IntegrationInstallTooltip></IntegrationInstallTooltip>

```bash npm2yarn
npm install @langchain/community @langchain/core
```

## Usage

The below example scrapes a Hacker News thread, splits it based on HTML tags to group chunks based on the semantic information from the tags,
then extracts content from the individual chunks:

import CodeBlock from "@theme/CodeBlock";
import Example from "@examples/document_transformers/html_to_text.ts";

<CodeBlock language="typescript">{Example}</CodeBlock>

## Customization

You can pass the transformer any [arguments accepted by the `html-to-text` package](https://www.npmjs.com/package/html-to-text) to customize how it works.
